<!DOCTYPE html>

<html>

  <head>
    <title>Underactuated Robotics: Policy
  Search</title>
    <meta name="Underactuated Robotics: Policy
  Search" content="text/html; charset=utf-8;" />
    <link rel="canonical" href="http://underactuated.mit.edu/policy_search.html" />

    <script src="https://hypothes.is/embed.js" async></script>
    <script type="text/javascript" src="htmlbook/book.js"></script>

    <script src="htmlbook/mathjax-config.js" defer></script> 
    <script type="text/javascript" id="MathJax-script" defer
      src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js">
    </script>
    <script>window.MathJax || document.write('<script type="text/javascript" src="htmlbook/MathJax/es5/tex-chtml.js" defer><\/script>')</script>

    <link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
    <script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
    <script>hljs.initHighlightingOnLoad();</script>

    <link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
  </head>

<body onload="loadChapter('underactuated');">

<div data-type="titlepage">
  <header>
    <h1><a href="index.html" style="text-decoration:none;">Underactuated Robotics</a></h1>
    <p data-type="subtitle">Algorithms for Walking, Running, Swimming, Flying, and Manipulation</p> 
    <p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
    <p style="font-size: 14px; text-align: right;"> 
      &copy; Russ Tedrake, 2020<br/>
      <a href="tocite.html">How to cite these notes</a> &nbsp; | &nbsp;
      <a target="_blank" href="https://docs.google.com/forms/d/e/1FAIpQLSesAhROfLRfexrRFebHWLtRpjhqtb8k_iEagWMkvc7xau08iQ/viewform?usp=sf_link">Send me your feedback</a><br/>
    </p>
  </header>
</div>

<p><b>Note:</b> These are working notes used for <a
href="http://underactuated.csail.mit.edu/Spring2020/">a course being taught
at MIT</a>. They will be updated throughout the Spring 2020 semester.  <a 
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture  videos are available on YouTube</a>.</p> 

<table style="width:100%;"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=feedback_motion_planning.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=robust.html>Next Chapter</a></td>
</tr></table>


<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 12"><h1>Policy
  Search</h1>

  <p>So far, most of our recommendations for control design have been
  relatively "local" -- leveraging trajectory planning/optimization as a tool
  and our ability to locally stabilize trajectories for even very complex
  systems using linear optimal control.  This is in stark contrast to the
  dynamic programming / value iteration methods that we started with, which
  attempt to solve for a control policy for every possible state;
  unfortunately, the dynamic programming methods as presented are restricted to
  relatively low dimensional state spaces.  What is missing so far is
  algorithms for synthesizing feedback controllers that scale to large state
  spaces and produce controllers that are, hopefully, less "local" than
  trajectory stabilization.</p>

  <p>In this chapter, we will explore another very natural idea: let us
  parameterize a controller with some decision variables, and then search over
  those decision variables directly in order to achieve a task and/or optimize
  a performance objective.  We'll refer to this broad class of methods as
  "policy search" or, when optimization methods are used, "policy
  optimization". </p>

  <section><h1>Problem formulation</h1>

    <p>Consider a static full-state feedback policy, $$\bu =
    \bpi_\balpha(\bx),$$ where $\bpi$ is potentially a nonlinear function, and
    $\balpha$ is the vector of parameters that describe the controller.  The
    control might take time as an input, or might even have it's own internal
    state, but let's start with this simple form.  </p>

    <p>How should we write an objective function for optimizing $\balpha$?  The
    approach that we used for trajectory optimization is quite reasonable --
    the objective was typically to minimize an integral cost over some time
    horizon (be it finite or infinite).  But in trajectory optimization, the
    cost is only ever defined based on forward simulation from a single initial
    condition.  We used the same additive cost structures in dynamic
    programming, where the Hamilton-Bellman-Jacobi equation provided optimality
    conditions for optimizing an additive cost from <i>every</i> initial
    condition; at least in the idealized equations, we were able to get away
    with saying $\forall \bx, \minimize_\bu ...$. </p>

    <p>But now we are playing a different game.  If we are searching over the
    some finitely parameterized policy, $\bpi_{\balpha}$, we can almost never
    expect to be optimal for every state -- and we need to somehow define the
    relevant importance of different states.  For finite-time, a distribution
    over initial conditions.  For infinite horizon, what really matters is the
    stationary distribution (which depends on the policy).  Let's start with
    the distribution over initial conditions.</p>

  </section>

  <section><h1>Controller parameterizations</h1>

    <p>Searching directly for $K$ with an LQR objective is known to be bad. The
    objective is non-convex, and the set of stabilizing controllers is not a
    convex set. (TODO: Give the 2D example)</p>

  </section>

  <section><h1>Trajectory-based policy search</h1>


  </section>

  <section><h1>Lyapunov-based approaches to policy
    search.</h1>


  </section>

  <section><h1>Approximate Dynamic Programming</h1>

  </section>

  <!-- TODO:
  Guided Policy Search, etc (see Robert V's review);
  PILCO and PIPPS (arxiv from Doya).
  -->

</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->

<table style="width:100%;"><tr style="width:100%">
  <td style="width:33%;text-align:left;"><a class="previous_chapter" href=feedback_motion_planning.html>Previous Chapter</a></td>
  <td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
  <td style="width:33%;text-align:right;"><a class="next_chapter" href=robust.html>Next Chapter</a></td>
</tr></table>

<div id="footer">
  <hr>
  <table style="width:100%;">
    <tr><td><em>Underactuated Robotics</em></td><td align="right">&copy; Russ
      Tedrake, 2020</td></tr>
  </table>
</div>


</body>
</html>
